Skip to content

fix: memory leak in the webhook TLS healthcheck#2690

Merged
sozercan merged 2 commits into
open-policy-agent:masterfrom
dethi:tls-webhook-healtcheck-memory-leak
Apr 14, 2023
Merged

fix: memory leak in the webhook TLS healthcheck#2690
sozercan merged 2 commits into
open-policy-agent:masterfrom
dethi:tls-webhook-healtcheck-memory-leak

Conversation

@dethi
Copy link
Copy Markdown
Contributor

@dethi dethi commented Apr 8, 2023

What this PR does / why we need it:

  • The resp.Body was never closed, thus causing one connection to be leaked for each executions.

  • Creating a new transport based on the default transport, to inherit some of the default timeouts. Most importantly, this ensure that there is a default TLSHandshakeTimeout (10s) and dial timeout (30s).

  • Disable http keep alive, to avoid reuse of the same http connection. Otherwise, it may fail the check when the certs is rotated (new cert on disk wouldn't match the cert attached to the reused connection opened earlier).

Which issue(s) this PR fixes

Fixes #2654

Special notes for your reviewer:

@dethi dethi force-pushed the tls-webhook-healtcheck-memory-leak branch from 648f541 to d98533c Compare April 8, 2023 12:09
@dethi dethi changed the title webhook: fix memory leak in the TLS healthcheck fix: memory leak in the webhook TLS healthcheck Apr 8, 2023
- The resp.Body was never closed, thus causing one connection to be
  leaked for each executions.

- Creating a new transport based on the default transport, to inherit
  some of the default timeouts. Most importantly, this ensure that there
  is a default TLSHandshakeTimeout (10s) and dial timeout (30s).

- Disable http keep alive, to avoid reuse of the same http connection.
  Otherwise, we may fail the healthcheck when the certs is rotated (new
  cert on disk wouldn't match the cert attached to the reused connection
  opened earlier).

Fix open-policy-agent#2654

Signed-off-by: Thibault Deutsch <thibault@arista.com>
@dethi dethi force-pushed the tls-webhook-healtcheck-memory-leak branch from d98533c to 69300f5 Compare April 8, 2023 12:11
Copy link
Copy Markdown
Contributor

@acpana acpana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(first, thanks for all the engagement on the issue and for opening a PR 💯 )

quick question re keep alives;

Comment on lines +23 to +25
// disable keep alives to ensure that http connection aren't reused, otherwise the check may
// fail if the cert was rotated in between
tr.DisableKeepAlives = true
Copy link
Copy Markdown
Contributor

@acpana acpana Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would keep alives here just cause some network flakiness? iow, if checks retry once the certs have been rotated, would this still be a problem?

Copy link
Copy Markdown
Contributor Author

@dethi dethi Apr 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would be just flakiness. I didn't get the time to validate my theory, but here is the scenario that I'm thinking (with keepalive enabled):

  • First health check pass, with certificate A. Because we drained the body before closing it, the connection is kept open (that's the default behaviour of the Go HTTP client, as long as the server also allow keepalive)
  • Then, the certificate is renewed. Certificate A is replaced by certificate B on file, and the transport layer of the HTTP server now use certificate B for new connections.
  • A new health check is started. The previous connection is open, in the pool of connections. The Go HTTP client select it and make a new request. We get a response. When we check the response, we compare what we have on disk (certificate B) with the peer cert associated to the connection (certificate A). The check fail.
  • This repeat indefinitely until the connection is closed, which may never happen (or after X minutes when the maximum connection lifetime is exceeded), because we are always properly draining and closing the body, so the HTTP client should always keep the connection open.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems likely. IIRC certs are only used for negotiating a session key, so rotating a cert wouldn't necessarily break a pre-existing connection.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

Patch coverage: 37.62% and project coverage change: -0.55 ⚠️

Comparison is base (143e8cf) 53.27% compared to head (69300f5) 52.72%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2690      +/-   ##
==========================================
- Coverage   53.27%   52.72%   -0.55%     
==========================================
  Files         120      123       +3     
  Lines       10594    10941     +347     
==========================================
+ Hits         5644     5769     +125     
- Misses       4515     4715     +200     
- Partials      435      457      +22     
Flag Coverage Δ
unittests 52.72% <37.62%> (-0.55%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
apis/status/v1beta1/zz_generated.deepcopy.go 0.00% <0.00%> (ø)
pkg/webhook/health_check.go 0.00% <0.00%> (ø)
pkg/webhook/policy.go 37.76% <0.00%> (-1.32%) ⬇️
...status/v1beta1/expansiontemplatepodstatus_types.go 8.69% <8.69%> (ø)
apis/status/v1beta1/util.go 76.31% <25.00%> (ø)
pkg/controller/expansion/stats_reporter.go 54.05% <45.83%> (ø)
pkg/controller/expansion/expansion_controller.go 55.47% <61.22%> (ø)
pkg/readiness/ready_tracker.go 69.19% <65.30%> (-0.46%) ⬇️
apis/status/v1beta1/constraintpodstatus_types.go 80.64% <100.00%> (ø)
...tatus/v1beta1/constrainttemplatepodstatus_types.go 73.91% <100.00%> (ø)
... and 3 more

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Copy Markdown
Contributor

@maxsmythe maxsmythe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for finding/fixing this!

@maxsmythe maxsmythe requested review from ritazh and sozercan April 10, 2023 21:27
@sozercan sozercan added this to the v3.12.1 milestone Apr 11, 2023
@sozercan sozercan merged commit 1215957 into open-policy-agent:master Apr 14, 2023
@dethi dethi deleted the tls-webhook-healtcheck-memory-leak branch April 14, 2023 18:14
davis-haba pushed a commit to davis-haba/gatekeeper that referenced this pull request Apr 18, 2023
Co-authored-by: Sertaç Özercan <852750+sozercan@users.noreply.github.com>
sozercan added a commit that referenced this pull request Apr 25, 2023
Co-authored-by: Thibault Deutsch <thibault@arista.com>
salaxander pushed a commit to salaxander/gatekeeper that referenced this pull request Apr 27, 2023
Co-authored-by: Sertaç Özercan <852750+sozercan@users.noreply.github.com>
Signed-off-by: Xander Grzywinski <xandergr@microsoft.com>
@ritazh ritazh modified the milestones: v3.12.1, v3.13.0 Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gatekeeper-controller-manager is leaking memory

6 participants